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■ Abstract 

o 

We analyze active learning algorithms, which only receive the classifications of examples when they ask for them, 
and traditional passive (PAC) learning algorithms, which receive classifications for all training examples, under log- 
concave and nearly log-concave distributions. We prove that active learning provides an exponential improvement 
over passive learning when learning homogeneous linear separators in these settings, answering an open question 
in (6). For passive learning, we provide a computationally efficient algorithm with optimal sample complexity for 
such problems; this provides the first positive answer to a longstanding open question of Ehrenfeucht et al. [22 1 and 
Blumer et al. |T0|. We also provide new bounds for active and passive learning in the case that the data might not be 
linearly separable, both in the agnostic case and and under the Tsybakov low-noise condition. To derive our results, 
we provide new structural results for (nearly) log-concave distributions, which might be of independent interest as 
well. 
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Learning linear separators is one of the central challenges in machine learning. They are widely used and have been 
long studied both in the statistical and computational learning theory. A seminal result of Blumer et al. [10], using 
tools due to Vapnik and Chervonenkis [47 1, showed that e?-dimensional linear separators can be learned to accuracy 
1 — e with probability 1 — 6 in the classic PAC model in polynomial time with 0((d/e) log(l/e) + log(l/<5)) 
examples. The best known lower bound for linear separators is tt(d/e + (1/e) log(l/5)), and this holds even in the 
case in which the distribution is uniform in the unit ball |36|. Whether the upper bound can be improved to match the 
lower bound via a polynomial time algorithm is been long-standing open question Il22l[l0l . In this work we resolve 
this question in the case where the underlying distribution belongs to the class of log-concave and nearly log-concave 
distributions, a wide class of distributions that includes the uniform distribution over any convex set and which has 
played an important role in several areas including sampling, optimization, integration, and learning 11381 . 

We also consider active learning, a major area of research of modern machine learning, where the algorithm only 
receives the classifications of examples when it requests them 1 18 1. Our main result here is a polynomial time active 
learning algorithm with label complexity that is exponentially better than the label complexity of any passive learning 
algorithm in these settings. This answers an open question in [6| and it also significantly expands the set of cases for 
which we can show that active learning provides a clear exponential improvement over passive learning. 

We also study active and passive learning in the case that the data might not be linearly separable. We specifically 
provide new bounds for the widely studied Tsybakov low-noise condition ir39l l8l l40ll2Tl[33l . as well as new bounds 
on the disagreement coefficient, with implications for the agnostic case (i.e., arbitrary forms of noise). 

Passive Learning In the classic passive supervised machine learning setting, the learning algorithm is given a set of 
labeled examples drawn i.i.d. from some fixed but unknown distribution over the instance space and labeled according 
to some fixed but unknown target function, and the goal is to output a classifier that does well on new examples coming 
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from the same distribution. This setting has been long studied in both statistical [4-8 , 49JQT) and computational learning 
theory B41I3T1F1 and has played a crucial role in the developments and successes of machine learning. 

However, despite remarkable progress, the basic question of providing polynomial time algorithms with tight bounds 
on the sample complexity has remained open. Several milestone results along these lines that are especially related to 
our work include the following. The analysis of Blumer et al. [10], proved using tools from [47 1, implies that linear 
separators can be learned in polynomial time with 0((d/e) log(l/e) + (1/e) log(l/<5)) labeled examples. Ehrenfeucht 
et al. [22| proved a bound that implies an fl(d/e + (1/e) log(l/(S)) lower bound for linear separators and explicitly 
posed the question of providing tight bounds for this class. Haussler, Littlestone, and Warmuth [29 ] established an 
upper bound of 0((d/ e) log(l/ 5)), which can be achieved in polynomial-time for linear separators. 

Blumer et al. ifTOll achieved polynomial-time learning by finding a consistent hypothesis (i.e., a hypothesis which 
correctly classifies all training examples); this is a special case of ERM 1381 . An intensive line of research in the 
empirical process and statistical learning theory literature has taken account of "local complexity" to prove stronger 
bounds for ERM (see J46] |45j El [37] SI] [21] |26] [28) ) . In the context of learning, local complexity takes account of the 
fact that really bad classifiers can be easily discarded, and the set of "local" classifiers that are harder to disqualify is 
sometimes not as rich. A recent landmark result of Gine and Koltchinskii [21 ] (see [43, 28]) is the bound for consistent 
algorithms of 

0((d/e) log(cap(e)) + (1/e) log(l/<5)) (1) 

where cap(e) is the Alexander capacity, which depends on the distribution [1J (see "Related Work" section for further 
discussion). However, this bound can be suboptimal for linear separators. 

In particular, for linear separators in the case in which the underlying distribution is uniform in the unit ball, the sample 
complexity is known I13611371 to be when computational considerations are ignored. Bshouty et al. 

|Q~2), using the doubling dimension [4 1, another measure of local complexity, proved a bound of 

0((d/e) v^gO/ej + (1/e) log(l/<5)) (2) 

for a polynomial-time algorithm. As a lower bound of f2(vd) on cap(e) for e = o(l/^/d) for the case of linear 
separators and the uniform distribution is implicit in 11261 . the bound of Gine and Koltchinskii [21 1 given by (HJ cannot 
yield a bound better than 

0((d/e) min{log d, log(l/e)} + (1/e) log(l/5)) (3) 

in this case. 

In this paper we provide a tight bound (up to constant factors) on the sample complexity of polynomial-time learning of 
linear separators with respect to log-concave distributions. Specifically, we prove a O ^ d+1 ° s ^ 1 / s '> s j upper bound using 
a polynomial-time algorithm that holds for any zero-mean log-concave distribution. We also prove an information 
theoretic lower bound that matches our (computationally efficient) upper bound for each log-concave distribution. This 
provides the first bound for a polynomial-time algorithm that is tight for an interesting non-finite class of hypothesis 
functions under a general class of data-distributions, and also characterizes (up to a constant factor) the distribution- 
specific sample complexity for each distribution in the class. In the special case of the uniform distribution, our upper 
bound closes the existing i7(min{ yTog(l/e), log(d)}) gap between the upper bounds (01 and (01 and the lower bound 
of 1361. 

Active Learning We also study learning of linear separators in the active learning model; here the learning algorithm 
can access unlabeled (i.e., unclassified) examples and ask for labels of unlabeled examples of its own choice, and 
the hope is that a good classifier can be learned with significantly fewer labels by actively directing the queries to 
informative examples. This has been a major area of machine learning research in the past fifteen years mainly due 
the availability of large amounts of unannotated or raw data in many modern applications [18], with many exciting 
developments on understanding its underlying principles as well 11231 [171 [5] [6] [26] [19] [14] [7] [33] [9) . However, with 
a few exceptions ||6l[T4][20), most of the theoretical developments have focused on the so called disagreement based 
active learning paradigm l27ll33l ; methods and analyses developed in this context are often suboptimal, as they take a 

'The PAC model 1 44] is a formalization of this setting. 
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conservative approach and consider strategies that query even points on which there is a small amount of uncertainty 
(or disagreement) among the classifiers still under consideration given the labels queried so far. The results derived 
in this manner often show an improvement in the 1/e factor in the label complexity of active versus passive learning; 
however, unfortunately, the dependence on the d term typically gets worse. 

By analyzing a more aggressive, margin based active learning algorithm, we prove that we can efficiently learn 
homogeneous linear separators when the underlying distribution is log-concave by using only 0((d + log(l/5) + 
loglog(l/e)) log(l/e)) label requests, answering an open question in J6). This represents an exponential improve- 
ment of active learning over passive learning and it significantly broadens the cases for which we can show that the 
dependence on 1/e in passive learning can be improved to only 0(log(l/ e)) in active learning, but without increasing 
the dependence on the dimension d. We note that an improvement of this type was known to be possible only for the 
case when the underlying distributions is (nearly) uniform in the unit ball J6] |20] [TT] |23) ; even for this special case, 
our analysis improves by a multiplicative log log(l/ e) factor the results of [6 1; it also provides better dependence on d 
than any other previous analyses (both disagreement based ||27ll26l and more aggressive ones ||20l[T7ll23l ). 

Techniques At the core of our results is a novel characterization of the region of disagreement of two linear separators 
under a log-concave measure. We show that for any two linear separators specified by normal vectors u and v, for any 
constant c S (0, 1) we can pick a margin as small as 7 = 9(a), where a is the angle between u and v, and still ensure 
that the probability mass of the region of disagreement outside of band of margin 7 of one of them is ca (Theorem|4]l. 
Using this fact, we then show how we can use a margin-based active learning technique, where in each round we only 
query points near the hypothesized decision boundary, to get an exponential improvement over passive learning. 

We then show that any passive learning algorithm that outputs a hypothesis consistent with 0(d/e + (1/e) log(l/<5)) 
random examples will, with probability at least 1 — 8, output a hypothesis of error at most e (Theorem|6]l. Interestingly, 
our analysis is quite dissimilar to the classic analyses of ERM. It proceeds by conceptually running the algorithm 
online on progressively larger chunks of examples, and using the intermediate hypotheses to track the progress of 
the algorithm. We show, using the same tools as in the active learning analysis, that it is always likely that the 
algorithm will receive informative examples. Our analysis shows that the algorithm would also achieve e accuracy 
with high probability even if it periodically built preliminary hypotheses using some of the examples, and then only 
used borderline cases for those preliminary classifiers for further training@ To achieve the optimal sample complexity, 
we have to carefully distribute the confidence parameter, by allowing higher probability of failure in the later stages, 
to compensate for the fact that, once the hypothesis is already pretty good, it takes longer to get examples that help to 
further improve it. 

Extensions We also study label-efficient learning in the presence of noise. We show how our results for the realizable 
case can be extended to handle Tsybakov noise, which has received substantial attention in statistical learning theory, 
both for passive and active learning [39 8 J |40] |2T] [6] [33] |27); this includes the random classification noise model 
commonly studied in computational learning theory PP . as well as the more general bounded (or Massart) noise [8 

HU]|21][33]. 

Our analysis for Massart noise leads to optimal bounds (up to constant factors) for active and passive learning of 
linear separators when the marginal distribution on the feature vectors is log-concave, improving the dependence on d 
over previous best known results. Our analysis for Tsybakov noise leads to bounds on active learning with improved 
dependence on d over previous known results in this case as well. 

We also provide a bound on the Alexander's capacity Q] I2TI and the closely related disagreement coefficient no- 
tion ll26l . which have been widely used to characterize the sample complexity of various algorithms [26, 19l [33l |2TI . 
This immediately implies concrete bounds on the labeled data complexity of several algorithms in the literature, in- 
cluding active learning algorithms designed for the purely agnostic case (i.e., arbitrary forms of noise), such as the A 2 
algorithm |5| and the DHM algorithm fl9l . 

We further extend our results both for passive and active learning to deal with nearly log-concave distributions; this 
is a broader class of distributions introduced by Applegate and Kannan Q, which contains mixtures of (not too 
separated) log-concave distributions. In deriving our results, we provide new tail bounds and structural results for 

2 Note that such examples would not be i.i.d from the underlying distribution! 
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these distributions, which might be of independent interest and utility, both in learning theory and more widely. 

1.1 Additional Related Work 

e-nets, Learning, and Geometry Small e-nets are useful for many applications, especially in Computational Geom- 
etry (see J42j). The same fundamental techniques of Vapnik and Chervonenkis Il47ll48l have been applied to establish 
the existence of small e-nets 11301 and to bound the sample complexity of learning l48l[l0l . and a number of interesting 
upper and lower bounds on the smallest possible size of e-nets have been obtained l34l [151 l2l . Our analysis implies a 
Oidj e) upper bound on the size of an e-net for a set of regions of disagreement between all possible linear classifiers 
and the target, when the distribution is zero-mean and log-concave. 

Alexander Capacity and the Disagreement Coefficient Roughly speaking the Alexander capacity UETJ quantifies 
how fast the region of disagreement of the set of classifiers at distance r of the optimal classifier collapses as a function 
r; the disagreement coefficient fl26l additionally involves the supremum of r over a range of values. Friedman ll24l 
provides guarantees on these quantities (for sufficiently small r) for general classes of functions in W 1 if the under- 
lying data distribution is sufficiently smooth. Our analysis implies much tighter bounds for linear separators under 
log-concave distributions (matching what was known for the much less general case of nearly uniform distribution 
over the unit sphere); furthermore, we also analyze the nearly log-concave case where we allow an arbitrary number of 
discontinuities, a case not captured by the Friedman 11241 conditions at all. This immediately implies concrete bounds 
on the labeled data complexity of several algorithms in the literature including the A 2 algorithm [5] and the DHM al- 
gorithm Q3Q , with implications for the purely agnostic case (i.e., arbitrary forms of noise), as well as the Koltchinskii's 
algorithm [ 33 1 and the CAL algorithm [5, 26 27 1. Furthermore, in the realizable case and under Tsybakov noise, we 
show even better bounds, by considering aggressive active learning algorithms. 

2 Preliminaries and Notation 

We focus on binary classification problems; that is, we consider the problem of predicting a binary label y based on its 
corresponding input vector x. As in the standard machine learning formulation, we assume that the data points (x, y) 
are drawn from an unknown underlying distribution Dxy over X x Y; X is called the instance space and Y is the 
label space. In this paper we assume that Y — {±1} and X = M. d ; we also denote the marginal distribution over 
X by D. Let C be the class of linear separators through the origin, that is C = {sign(u; ■ x) : w € M. d , \\w\\ = 1}. 
To keep the notation simple, we sometimes refer to a weight vector and the linear classifier with that weight vector 
interchangeably. Our goal is to output a hypothesis function w € C of small error, where err(iu) = eiTo XY (w) = 
P(x, y )~D XY [sign{w -x)^y\. 

We consider two learning protocols: passive learning and active learning. In the passive learning setting, the learning 
algorithm is given a set of labeled examples (xi,yx), . . . , (x m , y m ) drawn i.i.d. from Dxy an d the goal is output a 
hypothesis of small error by using only a polynomial number of labeled examples. In the (pool based) active learning 
setting, a set of labeled examples (xi,yi) . . . (x m ,y m ) is also drawn i.i.d. from Dxy', the learning algorithm is 
permitted direct access to the sequence of Xi values (unlabeled data points), but has to make a label request to obtain 
the label y,; of example Xi. The hope is that in the active learning setting we can output a classifier of small error by 
using many fewer label requests than in the passive learning setting by actively directing the queries to informative 
examples (while keeping the number of unlabeled examples polynomial). For added generality, we also consider the 
selective sampling active learning model, where the algorithm visits the unlabeled data points Xi in sequence, and for 
each i, makes a decision on whether or not to request the label yi based only on the previously-observed Xj values 
< i) an d corresponding requested labels, and never changes this decision once made. Both our upper and lower 
bounds will apply to both selective sampling and pool based active learning. 

3 The region of disagreement DIS(C) of a set of classifiers C is the of set of instances x s.t. for each x £ DIS(C) there exist two classifiers 
/,j£C that disagree about the label of x. 
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In the "realizable case", we assume that the labels are deterministic and generated by a target function that belongs to 
C. In the non-realizable case (studied in Section|8]l we do not make this assumption and instead aim to compete with 
the best function in C. 

Given two vectors u and v and any distribution D we denote by dg(/u, v) — P ^ (sign(it ■ x) ^ sign(v • x)); we also 
denote by 8(u, v) the angle between the vectors u and v. 

3 Log- Concave Densities 

Throughout this paper we focus on the case where the underlying distribution D is log-concave or nearly log-concave. 
Such distributions have played a key role in the past two decades in several areas including sampling, optimization, and 
integration algorithms [38 1, and more recently for learning theory as well 13211501 . In this section we first summarize 
known results about such distributions that are useful for our analysis and then prove a novel structural statement that 
will be key to our analysis (Theorem @). In Section [6] we describe extensions to nearly log-concave distributions as 
well 

We begin with the definition: 

Definition 1 A distribution over R d is log-concave if 'log /(•) is concave, where f is its associated density function. It 
is isotropic if its mean is the origin and its covariance matrix is the identity. 

Log-concave distributions form a broad class of distributions: for example, the Gaussian, Logistic, Exponential, and 
uniform distribution over any convex set are log-concave distributions. The following lemma summarizes known 
useful facts about isotropic log-concave distributions (most are from ||38l ; the upper bound on the density is from 
(32)). 

Lemma 2 Assume that D is log-concave in R d and let / be its density function. 

(a) If D is isotropic then P a ^ D [||X|| > a\fd\ < e~ a+1 . Moreover, if d — 1, we have: P X ^ D [X € [a, b}} < \b — a\. 

(b) IfD is isotropic, then f(x) > 2- 7d 2 9d ^ whenever < ||a:|| < 1/9. Furthermore, 2- Jd < /(0) < d{20d) d / 2 , 

andf(x) < A(d) exp(—B(d)\\x\\), where A(d) is 2 8d d d ^ 2 e and B(d) is 2 (d~i)(20(d-i))'~ d - 1 ^/ 2 ' f or a ^ x °f an y 
norm. 

(c) All marginals of D are log-concave. If D is isotropic, its marginals are isotropic as well. 

(d) //E[||X|| 2 ] = C 2 , then¥[\\X\\ > RC] < e- R+1 . 

(e) If D is isotropic and d — 1 we have g(0) > 1/8 and g{x) < 1 for all x. 

Throughout our paper we will use the fact that there exists a universal constant c such that the probability of dis- 
agreement of any two homogeneous linear separators is lower bounded by the c times the angle between their normal 
vectors. This follows by projecting the region of disagreement in the space given by the two normal vectors, and then 
using properties of log-concave distributions in 2-dimensions. The proof is implicit in earlier works (e.g., Vempala 
If50l ); for completeness, we include a proof here. 

Lemma 3 Assume D is an isotropic log-concave in R d . Then there exists c such that for any two unit vectors u and v 
in M. d we have c9(v, u) < do(u, v). 

Proof: Consider two unit vectors u and v. Let proj u v (x) denote the projection operator that, given x £ R d , orthogo- 
nally projects x onto the plane determined by u and v. That is, if we define an orthogonal coordinate system in which 
coordinates 1, 2 lie in this plane and coordinates 3, . . . , d are orthogonal to this plane, then x' = pr°3u,v{xx, ■ ■ ■ , Xd) = 
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(xi,x 2 )- Also, given distribution D over R d , define proj UiV (D) to be the distribution given by first picking x ~ D 
and then outputting a;' = proj u>v (x). That is, proj u , v (D) is just the marginal distribution over coordinates 1,2 
in the above coordinate system. Notice that if x' — proj UtV (x) then u ■ x = v! ■ x' where v! = proj u , v (u) and 
u' = proj u , v (v). So, if L> 2 = proj u . v (D) then dot", i>) = d Da (i/, ?/). 

By Lemma |2fc), we have that if D is isotropic and log-concave, then D 2 is as well. Let A to be the region of 
disagreement between u' and v' intersected with the ball of radius 1/9 in R 2 . The probability mass of A under D2 is 
at least the volume of A times mi xe A D2(x). So, using Lemma|3b) 

d D Ju',v') > volt A) inf D 2 (x) > c6(u,v). 

x<aA 



To analyze our active and passive learning algorithms we provide a novel characterization of the region of disagreement 
of two linear separators under a log-concave measure: 

Theorem 4 For any c\ > 0, there is a C2 > such that the following holds. Let u and v be two unit vectors in R d , 
and assume that 8(u, v) — a < tt/2. Assume that D is isotropic log-concave in R d . Then 

I?i~_d [sign(u • x) 7^ sign(w • x) and \v ■ x\> C2a] < c\a. (4) 

Proof: Choose ci, C2 > 0. We will show that, if C2 is large enough relative to 1/ci, then (|4| holds. Let b — 02a. Let 
E be the set whose probability we want to bound. 

Arguing as in the proof of Lemma[3j we may assume without loss of generality that d = 2. 

Next, we claim that each member x of E has ||x|| > b/a = C2- Assume without loss of generality that v ■ x is 
positive. (The other case is symmetric.) Then u ■ x < 0, so the angle of a; with u is obtuse, i.e. 8(x, u) > tt/2. Since 
6(u, v) = a, this implies that 

9(x,v) > tt/2- a. (5) 

But x ■ v > b, and v is unit length, so ||a;|| cos8(x,v) > b, which, using (0, implies cos(-7r/2 — a) > b, which, 
since cos(7r/2 — a) < a for all a £ [0,7r/2], in turn implies || x \\ > b/a = 02- This implies that, if B(r) is a ball of 
radius r in R 2 , that 

00 

¥[E] =^P[£n {B({i + l)c 2 ) - S(ica))]. (6) 

i=i 

To obtain the desired bound, we carefully bound each term in the RHS. Choose i > 1. 
Let f(xi, X2) be the density of D. We have 

¥[Er\(B{{i + l)c 2 )-B(ic 2 ))} 

lE(xi,x 2 )f(xi,x 2 ) dxidx 2 - 



I (x 1 ,x- 2 )£B((i+l)c 2 )-B(ic 2 ) 

Applying the density upper bound from Lemma|2]with d = 2, there are constants C\ and C 2 such that 

F[E n (B((i + l)c 2 ) - B[ic 2 ))] 

< I ^2)^1 exp(— c 2 C 2 «) dx\dx2 

J(x 1 ,X2)£B((i+l)c 2 )-B(iC2) 

= C\ exp(-c 2 C 2 i) / l£(xi,a; 2 ) dxida; 2 . 



If we include B(?c 2 ) in the integral again, we get 

P[£?n(B((i+l)oa)-B(ic2))] 

< Ci exp(-c 2 C 2 z) / 1b(«i,^2) dxida; 2 . 

J(a;i,a;2)e-B((i+l)c2) 
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Now, we exploit the fact that the integral above is a rescaling of a probability with respect to the uniform distribution. 

'3 



Let C-i be the volume of the unit ball in R 2 . Then, we have 



F[En(B{{i+l)c 2 )-B(ic 2 ))} 
< Ci exp(-c 2 C 2 i)C 3 {i + l) 2 cl<y/ir 
— C^(? 2 a{i + l) 2 exp(— c 2 C 2 i), 

for C4 = CiCs/tt. Returning to ©, we get 



oo 



E] = > C iC 2 2 a(i + If exp(-c 2 C 2 i) 



i=l 

Cac 2 x — — — „ x a. 



4e 2c 2 C 2 _ 3 e c 2 C 2 + 1 



Since 



this completes the proof. 



( e c 2 C 2 _ if 



4e 2c 2 C 2 _ 3e c 2 C 2 + X 

lim ci x = = 0, 

c 2 ^oo z ( e c 2C 2 _ 1) 3 



We note that a weaker result of this type was proven (via different techniques) for the uniform distribution in the unit 
ball in [6|. In addition to being more general, Theorem[4]is tighter and more refined even for this specific case - this 
improvement is essential for obtaining tight bounds for polynomial time algorithms for passive learning (Section [5]) 
and better bounds for active learning as well. 



4 Active Learning 

In this section we analyze a margin based algorithm for actively learning linear separators under log-concave distribu- 
tions |6 | (Algorithm[T|i. Lower bounds proved in Section|7]show that this algorithm needs exponentially fewer labeled 
examples than any passive learning algorithm. 

Algorithm Q] is somewhat like the ellipsoid algorithm, except (a) it maintains a ball containing the target, instead of 
an ellipse, (b) it updates this ball only after adding multiple constraints, instead of just one, and (c) it maintains the 
target inside of the ball only with high probability, instead of certainly. Picking multiple random constraints allows the 
algorithm to ensure that any new hypothesis satisfying them is closer to the target than the old one, and therefore can 
be enclosed in a smaller ball. This algorithm has been previously proposed and analyzed in [6 1 for the special case of 
the uniform distribution in the unit ball. In this paper we analyze it for the much more general class of log-concave 
distributions. 

Theorem 5 Assume D is isotropic log-concave in R d . There exist constants C\,C 2 s.t. for d > 4, and for any 
e,S > 0, e < 1/4, using Algorithm\l\with bk = §r and mt = C 2 (d + In ), after s — [log 2 ^] iterations, 

we find a separator of error at most e with probability 1 — S. The total number of labeled examples needed is 
0((d + log(l/5)+loglog(l/ e ))log(l/e)). 

Proof: Let c be the constant from Lemma [3] We will show, using induction, that, for all k < s, with probability at 
least 1 — | X)i<fe (T+i-Tp'' an y ^ consistent with the data in the working set W{k) has err(w) < c2~ fc , so that, in 
particular, err(wfc) < c2~ k . 

The case where k = 1 follows from the standard VC bounds (see e.g. ,(47)). Assume now the claim is true for k — 1 
(k > 1), and consider the fcth iteration. Let 

Si = {x : \w k -i ■ x\ < fofc-i}, and S 2 = {x : \w k -i ■ x\ > 
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Algorithm 1 Margin-based Active Learning 



Input: a sampling oracle for D, a labeling oracle, sequences rrik > 0, k € Z + (sample sizes) and bk > 0, k G Z 

(cut-off values). 

Output: weight vector w s . 

• Draw mi examples from D, label them and put them in 

• iterate k = 1, . . . , s 

- find a hypothesis w k with ||u>fc|| 2 = 1 consistent with all labeled examples in W(k). 

- MW{k + l) = W{k). 

- until rrik+i additional data points are labeled, draw sample x from D 

* if \wk ■ x\ > bk, then reject x, 

* else, ask for label of x, and put into W(k + 1). 



By the induction hypothesis, we know that, with probability at least 1 — | Yli<k-i (i+s-i) 5 ' a ^ ^ cons i sten t with 
W(k — 1), including Wk-i, have errors at most c2~( fc ~ 1 ). Consider an arbitrary such w. By Lemma[3]we have 
9(w, w*) < 2~(' £-1) and 6(wk-i, w*) < 2~ ( - k ~ 1 \ so 9(w k -i, w) < 4 x 2~ k . Applying Theorem^ there is a choice 
of C\ (the constant such that b^-i = Ci/2 k ~ 1 ) that satisfies 

c2 _fc 

P((%_i • x)(w ■ x) < 0, x € 5 2 ) < 



• • x) < 0, x e 5 2 ) < 



4 

c2- fe 



4 
So 

P((w-x)(w* -x)<0,x£S 2 )<^-. (7) 

Now let us treat the case that x € Si. Since we are labeling m% data points in Si at iteration fe — 1, classic Vapnik- 
Chervonenkis bounds [47 1 imply that, if C 2 is a large enough absolute constant, then with probability 1 — 5/ (4(1 + 
s — k) 2 ), for all w consistent with the data in W(k), 

erviwlS!) = P((w ■ x)(w* ■ x) < | x G Si) < = (8) 

Abk 4Gi 

Finally, since Si consists of those points that, after projecting onto the direction Wk-i, fall into an interval of length 
2b k , Lemma|2]impliesthatP(S , i) < 2b k . Putting this together with Q and ©, with probability 1 - | J2i<k (l+B-i) 2 > 
we have err(u)) < c2~ k , completing the proof. I 



5 Passive Learning 

In this section we show how an analysis that was inspired by active learning leads to optimal (up to constant factors) 
bounds for polynomial-time algorithms for passive learning. 

Theorem 6 Assume that D is zero mean and log-concave in R d . There exists constant C3 s.t.for d > 4, and for any 
e, 5 > 0, e < 1/4, any algorithm that outputs a hypothesis that correctly classifies m — c '3( rf + lo s( 1 /' ? )) examples finds 
a separator of error at most e with probability at least 1 — S. 
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Proof: First, let us prove the theorem in the case that D is isotropic. We will then treat the general case at the end of 
the proof. 

While our analysis will ultimately provide a guarantee for any learning algorithm that always outputs a consistent 
hypothesis, we will use intermediate hypothesis of Algorithm[T]in the analysis. 



2* 



Let c be the constant from Lemma[3] While proving Theorem[5] we proved that, if Algorithm[T]is run with bk 
and nik = C2 (d + In 1+S g~ h ), that for all k < s, with probability > 1 — | Yli<k ( 1+ l_^2 an Y w consistent with the 
data in W(k) has err(zi) < c2 . Thus, after s = 0(log(l/e)) iterations, with probability at least > 1 — 5, any linear 
classifier consistent with all the training data has error < e, since any such classifier is consistent with the examples in 
W(s). 

Now, let us analyze the number of examples used, including those examples whose labels were not requested by 
AlgorithmQ] Lemma|2]implies that there is a positive constant c\ such that P(Si) > cibk'- again, Si consists of those 
points that fall into an interval of length 2b k after projecting onto Wk-i- The density is lower bounded by a constant 
when bk < 1/9, and we can use the bound for 1/9 when b k > 1/9. 

The expected number of examples that we need before we find rrik elements of Si is therefore at most ^y- . Us- 
ing a Chernoff bound, if we draw examples, the probability that we fail to get mu members of Si is at most 
exp (— mfc/6), which is at most 5/(4(1 + s — k) 2 ) if C2 is large enough. So, the total number of examples needed, 
JZi, ^ir, is at most a constant factor more than 

* — 'ft. ClOfc 

±2 k (d + lo S n+S - k 

k=l ^ 

S 

= 0(2 s {d + log(l/d))) + 2k + s - k ) 



k=l 



o( ^log(l/5) )+E2fclog(i + ^ fc) 



k=l 

We claim that ££ =1 2 k log(l + s - k) = 0(l/e). We have 

S S 

2fc iogi 1 + * - *o < Yl 2fe ( 3 + s - k ) 

fc=l fc=l 
< / 2 k (S + s-k) 

Jk=l 

(since 2 fe (3 + s — k) is increasing for k < s + 1) 
2(2'-l)(l + ln(4))-sln2 nn . . 
In 2 

completing the proof in the case that D is isotropic. 

Now let us treat the case in which D is not isotropic. Suppose that E is the covariance matrix of D, so that S -1 / 2 
is the "whitening transform". Suppose, for m = c ' 3 ( rf + lo g( 1 /' 5 )) ; an algorithm is given a sample S of examples 
(%i,yi), {%m-> Um) for Xi, x m drawn according to D, and y m labeled by a target hypothesis with weight vector 
v. Note that w is consistent with S if and only if w T Y}l 2 is consistent with (S _1 / 2 xi, yi), (S _1 / 2 x m , y m ) (so 
those examples are consistent with v T Y}/ 2 ). So our analysis of the isotropic case implies that, with probability 1 — 5, 
for any w consistent with (xi, yi), (x m , y m ), we have 

P(sign((u. T S 1/2 )(S- 1/2 a;)) ^sign(( W T S 1 / 2 )(S- 1 / 2 a;))) < e, 

which of course means that P(sign(w T x) 7^ sign(w T x)) < e. I 

We conclude this section by pointing our several implications and connections of Theorem|6]and its proof. 
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(1) The separator in Theorem|6]can be found in polynomial time, for example by using linear programming. 

(2) The analysis of Theorem[6]also bounds the number of unlabeled examples needed by the active learning algorithm 

of Theorem [5] This shows that an algorithm can request a nearly optimally small number of labels without 
increasing the total number of examples required by more than a constant factor. 

(3) Since we prove that any hypothesis consistent with the training data has error rate at most e with probability 

1 — S, setting 5 to a constant gives a proof of a 0(d/e) bound on the size of an e-net for the following set: 
{{x : (w ■ x)(w* ■ x) < 0} : we R n }. 

6 More Distributions 

In this section we consider learning with respect to a more general class of distributions, first analyzing active learning 
using a relaxation of the assumption of log-concavity in a manner considered previously in [ 3 1 and 1 1 3 1 , then removing 
the assumption that the distribution is isotropic. 

We start by laying some groundwork establishing conditions on a set T> of distributions that imply that efficient 
active learning w.r.t. distributions in T) is possible. Our later proofs will proceed by showing that different classes of 
distributions satisfy these conditions. The first step is to give a name to the key properties used in our analysis. 

Definition 7 A set T> of distributions is admissible if it satisfies the following: 

• There exists c such that for any D G T> and any two unit vectors u and v in M. d we have c9(v, u) < d£)(u, v). 

• For any c\ > 0, there is a c-i > such that the following holds for all D G T>. Let u and v be two unit vectors in 

R d , and assume that 6(u,v) — a < tt/2. Then 

IPx~L>[sign(w • x) sign(i> ■ x), \v • x\ > C2C1] < c%a. 

• There are positive constants 03,04,05 such that, for any D' G T>, for any projection D of D' onto a one- 

dimensional subspace, the density f of D satisfies 

— f(x) < C3 for all x, 

- f(x) > C4 for all x with \x\ < C5. 

Theorem 8 IfDis admissible, then arbitrary / G C can be learned with respect to arbitrary distributions in T> in 
polynomial time in the active learning model from 0((d + log(l/<5) + loglog(l/e)) log(l/e)) labeled examples, and 

in the passive learning model from O ^ examples. 

Proof: The proofs of Theorem|5]and Theorem|6]can be used without modification. I 



6.1 The nearly log-concave case 

In this section, we consider a generalization that relaxes the constraint of log-concavity, considered previously in (3) 
and El. 

Definition 9 A density function f : R™ — > R is /3 log-concave if for any X G [0, 1], x\ G R", x-i G K", we have 

f(Xx 1 + (l-X)x 2 )>e-f 3 f(x 1 ) x f(x 2 ) 1 -\ 

Clearly, a density function / is log-concave if it is 0-log-concave. An example of a 0(l)-log-concave distribution 
is a mixture of two log-concave distributions whose covariance matrices are /, and whose means [i\ and /12 have 
— M2II = 0(1). Our main result about /3-logconcave distributions is the following. 
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Theorem 10 Let j3 > be a sufficiently small constant. Assume that D is an isotropic j3 log-concave distribution 
in R d . Then arbitrary f £ C can be learned with respect to D in polynomial time in the active learning model from 
0((d + log(l/<5) + log log(l/e)) log(l/e)) labeled examples, and in the passive learning model from O ^ d + l °s( l / s ) j 
examples. 

To prove Theorem[lO]we provide several new properties for such distributions. We start by stating a couple of technical 
lemmas. (Proofs are in Appendices lAl and iBl) 

Theorem 11 Let f : R 2 — > Rbe the density function of a log-concave distribution centered at z and with covariance 
matrix A = Mf[(X — z)(X — z) T ]. Assume f satisfies \\z\\ < £ and 1/C < J (u ■ x) 2 f(x)dx < C for every unit 
vector u, for C > 1 constant close to 1. We have: (a) Assume (1/20 + — £^ < 1/9. Then there exist an 

universal constant c s.t. we have f(x) > c, for all x with < ||x|| < 1/20. (b) Assume C < 1 + 1/5. There exist 
universal constants c\ and c 2 such that f(x) < C\ exp(— C?2||a;||)/or all x. 

Theorem 12 Let f : R — > R be the density function of a log-concave distribution centered at £ with standard 
deviation a — y/Vax f(X). Then f(x) < 1 j a for all x. If furthermore f satisfies 1/C < E/ [X 2 ] < C for C > 1 and 
6 / yl/^ — ^ 2 — 1/^' tnen we have f iff) > cfor some universal constant c. 

We next show that for any isotropic (3 log-concave density / there exists a log-concave density / whose center is 
within e(C - l)\/Cd of f's center and that satisfies f(x)/C < f(x) < Cf(x), for C as small as e /31ogd . The fact 
C depends only exponentially in log d (as opposed to exponentially in d) is key for being able to argue that such 
distributions have light tails. 

Lemma 13 For any isotropic j3 log-concave density function f there exists a log-concave density function f that 
satisfies f(x)/C < f{x) < Cf(x) and J x(f(x) - f(x))dx < e(C - l)VCd,forC = e £n°g 2 (rf+i)l . Moreover, 
we have 1 jC < J (u ■ x) 2 f (x)dx < C for every unit vector u. 

Proof: Note that if the density function / is (3 log-concave we have that ft, = In / satisfies that for any A £ [0, 1], 

xx £ W\ x 2 £ K n , we have h(X Xl + (1 - \)x 2 ) >-(3 + Xh{x x ) + (1 - \)h(x 2 ). 

Let h be the function whose subgraph is the convex hull of the subgraph of h. That is, h(x) is the maximum of all 
values of J2i=i Qtih(ui) for any u\, Uk £ R d and a\, otk £ [0, 1] such that X)i=i a i = 1 an ^ x = J2i=i a i u i- 
Note that, if the components of Ui are Ui t i, iti^, we can get h(x) by starting with 

T = ui,d, ft(ui)), —, ("fc,i, — , «fc,d, ft.(wfc))} 

taking the convex combination of the members of T with mixing coefficients ax, and then reading off the last 
component. Caratheodory's theorem^ implies that we can get the same result using a mixture of at most d+1 members 
of T. In other words, we can assume without loss of generality that k = d + 1, so that 

d+1 

h(x) = max > aMxi), (9) 

Because of the case where (ax, otd+x) concentrates all its weight on one component, we have h(x) < h(x). 
We also claim that 

h{x)>h{x)-p\\og 2 {d+l]\. (10) 



4 Caratheodory's theorem states that if a point x of R d lies in the convex hull of a set P, then there is a subset P of P consisting of d + 1 or 
fewer points such that x lies in the convex hull of P. 
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We will prove this by induction on log 2 (d + 1), treating the case in which d + 1 is a power of 2. (By padding with 
zeroes if necessary, we may assume without loss of generality that d + 1 is a power of 2.) 

The base case, in which d = 1, follows immediately from the definitions. Let k = d + 1. Assume that x = 
a\X\ + (X2X2 + •■• + a k x k , a i — 1' a i — 0- We can write this as: 

x = (ai + a 2 )xi. 2 + (a 3 + 04)2:3,4 + ...(afc-i + a k )x k -i.k 



where Xij+i 



— — Xi+i, for all i. Now, by induction we have: 



h(x) > -/31og(&/2) + (ai + a 2 )/i(xi )2 ) + 
... + (a k -i + a k )h{x k -i,k) 
> -P\og{k/2) 

- (ai + a 2 )(3 + aih(xi) + a 2 h(x 2 ) 

- (a 3 + a^)fi + a 3 h(x3) + 04/1(2:4) + 

- (ofc_i + afc)/3 + a fc _ift,(2; fe _i) + a k h(x k ) 

= -/31og(fc) + 01/1(2:1) + a 2 h(x 2 ) + 03/1(2:3) + ...a k h(x k ). 

The last inequality follows from the fact that 53ILi fl i = 1- 

So, we have proved (fTob . If we further normalize e h to make it a density function, we obtain / that is log-concave and 
satisfies f(x)/C < f(x) < Cf(x), where C = e /3riog 2 (d+i)l _ xhis i mp ii es that for any x we have \f(x) — f(x)\ < 
(C - l)/(2:). 

We now show that the center of / is close to the center of /. We have: 



x{f{x) - f{x))dx 



< 



\x\\\f{x)-f{x)\dx 



< (C-l) / \\x\\f(x)dx = (C-l) 



IXII > r]dr. 



r=0 



Using concentration properties of / (in particular Lemma|2} we get 



x(f(x) - f(x))dx 



< (C-l) e'vm^dr 

Jr=0 

= e(C-l)VCd, 



as desired. 



Theorem 14 Assume j3 is a sufficiently small non-negative constant and let T> be the set of all f3 log-concave distri- 
butions. 

(a) T> is admissible. 

(b) Any D eV has light tails. That is: F(\\X\\ > RVCd) < Ce- R+1 ,forC = e^ lo ^ d+1 *>\ 

Proof: (a) Choose D € T>. As in Lemma[3] consider the plane determined by u and v and let proj u ^ v (x) denote the 
projection operator that given x £ R d , orthogonally projects x onto this plane. If D 2 = proj u , v (D) then do(u,v) — 
du 2 (u', v'). By using the Prekopa-Leindler inequality [25| one can show that D 2 is (3 log-concave (see e.g., [13|). 
Moreover, if D is isotropic, than D 2 is isotropic as well. By Lemma [T3l we know that there exists a C-isotropic log- 
concave distribution D 2 centered at z, \\z\\ < e, satisfying f(x)/C < f(x) < Cf(x) and 1/C < f (u ■ x) 2 f(x)dx < 
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C for every unit vector u, for constants C = efi and e = e(C — 1)V2C. For /? sufficiently small we have (1/20 + 
e)/y/l/C - e 2 < 1/9. By Theorem HI] we have / 2 (a;) > c, for ||.t|| < 1/20 which implies f 2 (x) > c/C, for 
||x|| < 1/20. Using a reasoning as in Lemma[3]we get the desired result. The density bounds in the n = 1 case follow 
from TheoremfTTI 

The generalization of Theorem|4]follows the same proof, except using Theorems QT| and [T2] 

(b) Since X is isotropic, we have ¥.f[X ■ X] = d (where / is its associated density). By Lemma [T3~l there exists a 
log-concave density / such that f(x) jC < f{x) < Cf{x), for C = e^^^+i)! . This implies Ef[X ■ X] < Cd. By 
Lemma|2]we get that that under /, P(| |X| | > Ry/Cd) < e~ R+1 , so under / we have P(| \X\ \ > Ry/Cd) < Ce- R+1 . 
I 

Applying Theorem[8]completes the proof of Theorem [TOl 



6.2 More covariance matrices 

In this section, we extend Theorem|5]to the case of arbitrary covariance matrices. 

Theorem 15 If all distributions in T> are zero-mean and log-concave in R d , then arbitrary f € C be learned in poly- 
nomial time from arbitrary distributions in T> in the active learning model from O((d+log(l/j)+loglog(l/e)) log(l/e)) 
labeled examples, and in the passive learning model from O ^f^+lsMi/^l^ examples. 

Our proof is through a series of lemma. First, Lovasz and Vempala l38ll have shown how to reduce to the nearly 
isotropic case. 

Lemma 16 ([38 1) For any constant K > 0, there is a polynomial time algorithm that, given polynomially many 
samples from a log-concave distribution D, outputs an estimate E of the covariance matrix of D such that, with 
probability 1 — 8 the distribution D' obtained by sampling x from D and producing I]" 1 / 2 ^ has < E((u ■ x) 2 ) < 
1 + k for all unit vectors u. 

As a result of Lemma[T6l we can assume without loss of generality that the distribution D satisfies j^— < E((u-x) 2 ) < 
1+k for an arbitrarily small constant n. By TheoremQTl this implies that, without loss of generality, there are constants 
ci, C4 such that, for the density / of D, we have 

f(x) > ci for all x with \\x\\ < c 2 , (11) 

and for all x, 

f(x) < c 3 exp(-c 4 ||a;||). (12) 
We will show that these imply that T> is admissible. 

Lemma 17 (a) There exists c such that for any two unit vectors u and v in M. d we have c8(v, u) < dp (it, v). 

(b) For any c§ > 0, there is a c-j > such that the following holds. Let u and v be two unit vectors in R d , and assume 
that 6{u 1 v) — a < 7r/2. Then 

Px~L>[sign(w • x) 7^ sign(w • x), \v ■ x\ > c-jct] < c^a. 

Proof: (a) Projecting D onto a subspace can only reduce the norm of its mean, and its variance in any direction. 
Therefore, as in the proof of Lemma [3] we may assume without loss of generality that d — 2. Here, let us define 
A to be the region of disagreement between u' and v' intersected with the ball B C2 of radius c 2 in R 2 . Then we 
have d£) 2 (u',i/) > vol(A) mi x€ A D 2 {x) > vo\(B C2 )ci8(u,v). (b) This proof basically amounts to observing that 
everything that was needed for the proof of Theorem[4]is true for D, because of ( fTTT l and ( TT2l . I Armed with 

Lemmas [T71 to prove Theorem [T5l we can just apply Theorem[8] 
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7 Lower Bounds 



In this section we give lower bounds on the label complexity of passive and active learning of homogeneous linear 
separators when the underlying distribution is ft log-concave, for a sufficiently small constant ft. These lower bounds 
are information theoretic, applying to any procedure, that might not be necessarily computationally efficient. 

Our key lemma is a lower bound on the packing numbers Md(C, e). Recall that the e-packing number, Md(C, e), 
is the maximal cardinality of an e-separated set with classifiers from C, where we say that w%, wn are e-separated 
w.r.t V if dD(wi,Wj) > e for any i ^ j. We have: 

Lemma 18 There is a positive constant c such that, for all ft < c, the following holds. Assume that D is ft log- 
concave in R d , and that its covariance matrix has full rank. For all sufficiently small e, d £ N, we have Afo(C, e) > 

Vd (c_\ d - 1 _ I 
2 \2e) L ' 

Proof: We first prove the lemma in the case that D is isotropic. The proof in this case follows the outline of a proof 
for the special case of the uniform distribution in J36). 

Let UBALLd be the uniform distribution on the surface of the unit ball in M. d . By Theorem[l4] there exists c such that 
for any two unit vectors u and v in M. d we have c8(v, u) < do(u,v). This implies that for a fixed u the probability 
that a randomly chosen v has dn(u, v) < e is upper bounded by the volume of those vectors in the interior of the unit 
ball whose angle is at most e/c divided by the volume of the unit ball. Using known bounds on this ratio (see ll36lD we 

haveP 1 , eUB ALL d [dij(M,w) < e] < ^= (^) , so P u ,„ eU BALL d [di)(u, v) < e] < ^= (^) ~ . That means that for 
a fixed s if we pick s normal vectors at random from the unit ball, then the expected number of pairs of half-spaces 
that are e-close according to D is at most ^= (2s) . Removing one element of each pair from S yields a set of 

s— 1 halfspaces that are e-separated. Setting s — ^e/c^- 1 1 l ea ds the desired result. 

To handle the non-isotropic case, suppose that S is the covariance matrix of D, so that S -1 / 2 is the whitening trans- 
form. Let D' be the whitened version of D, i.e. the distribution obtained by first choosing x from D, and then producing 
Yr x l 2 x. Wehaved D (w,w) = dry (wE 1 / 2 , wY, 1 / 2 ) (because sign(w • x) ^ signfw • a;) iff sign((«S 1 / 2 ) • (XT 1 / 2 ^) ^ 
sign((wS 1 / 2 ) • (E _1 / 2 x))). So we can use an e-packing w.r.t. D' to construct an e-packing of the same size w.r.t. D. I 

Using Lemma [18] we get the following lower bound for passive supervised learning under isotropic log-concave 
distributions. 



Theorem 19 For a small enough constant ft, for any ft log-concave distribution D whose covariance matrix has full 
rank, the sample complexity of learning origin- centered linear separators under D in the passive learning model is 

n(f + itos(i))- 



Proof: It is known 1 36 1 that, for any distribution D, the sample complexity of PAC learning origin-centered linear 
separators w.r.t. D is at least ^ M (c,2e) ^ Applying Lemma[T8l gives an Sl(d/e) lower bound. 

It is known 1 36 ] that, if for each e, there is a pair of classifier v, w such that do (u, w) = e, then the sample complexity 
of PAC learning is f2((l/e) log(l/<5)); this requirement is satisfied by D. I 

We also obtain a lower bound for active learning. 



Theorem 20 For small enough constant ft, the sample complexity of active learning of linear separators under ft 
log-concave distributions is O (dlog (-)) • 

Proof: As shown in (35), in order to output a hypothesis of error at most e with probabality at least 1 — 5, where 
8 < 1/2 and active learning algorithm that is allowed to make arbitrary yes-no queries must make f2(logMo(C, e)) 
queries. Using this together with Lemma[T8lwe get the desired result. I 
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Note that, if the covariance matrix of D does not have full rank, the number of dimensions is effectively less than d, 
so our lower bound essentially applies for all log-concave distributions. 



8 The inseparable case 

This this section, we extend some of our results to cases in which the data is not linearly separable. 



8.1 Disagreement based learning 



In this section we compute two closely related distribution dependent capacity notions: the Alexander capacity and 
the disagreement coefficient; they have been widely used for analyzing the label complexity of non-aggressive active 
learning algorithms l26l [191 [331 l27l [33ll . We begin with the definitions. For r > 0, define B(w,r) = {u E C : 
Pu(sign(u • x) 7^ sign(u> • x)) < r}. For any MCC, define the region of disagreement as 

DIS(W) = {x e X : 3w, ueW s.t. sign(u • x) ^ sign(u> • x))}. 

Define the Alexander capacity function cap^, D (-) for ai'tC w.r.t. D as: 

P D (DIS(BK,r))) 

caPu,* ,23 (r) = . 

r 

Define the disagreement coefficients for w* £ C w.r.t. D as: 

disu,.,23(e) = sup[cap w , i£) (r)]. 



Theorem 21 Let f3 > be a sufficiently small constant. Assume that D is an isotropic /3 log-concave distribution in 

R d . For any w* , for any e, ca,p w , D (e) is 0{d 1 l 2+ ^i log(l/e)). Thus dis ti ,* : i)(e) = 0(d 1 / 2+ 2&2 log(l/e)). 

Proof: Roughly, we will show that almost all x classified by a large enough margin by w* are not in DlS(B(w* , r)), 
because all hypotheses agree with w* about how to classify such x, and therefore all pairs of hypotheses agree with 
each other. Consider w such that d(w, w*) < r; by Theorem[T4lwe have 9(w, w*) < cr. Define C = e^ 1 ^^ 1 )! as 
in the proof of Theorem[l4] For any x such that | | < V dC log(l/r) we have 

(w ■ x — w* ■ x) < \\w — w*\\ X \\x\\ 
< crVd~Clog(l/r). 

Thus, if x also satisfies \w* ■ x\ > cr\J dC log(l/r) we have (w* ■ x)(w ■ x) > 0. Since this is true for all w, any such 
x is not in DlS(B(h, r)). By TheoremfT4lwe have, for a constant C2, that 

Pz-dO* ' A < crVCd\og(l/r)) < c 2 rVCdlog(l/r). 
Moreover, by Theorem[l4]we also have 

Px~z)[||a:|| > crVCdlog(l/r)} < r. 
These both imply cap w ^ D (e) = 0(C^ 2 Vdlog(l/e)). I 

Theorem El immediately leads to concrete bounds on the label complexity of several algorithms in the literature ll26l 
[I6ll5l l33lll9l . For example, by composing it with a result of [19], we obtain a bound of is 6 (d 3 / 2 (log 2 (l/e) + (^/e) 2 )) 
for agnostic active learning when D is isotropic log-concave in R d ; that is we only need (9(<i 3 / 2 (log 2 (l/e) + {v/e) 2 ))) 
label requests to output a classifier of error at most v + e, where v = min.^gc err(w). 
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8.2 The Tsybakov condition 



The Tsybakov condition [39| is the assumption that the classifier h that minimizes P( x y ^ DxY (h(x) ^ y) is a linear 
classifier, and that, for the weight vector w* of that optimal classifier, there exist known parameters a, a > such that, 
for all w, we have 

a(d D (w,w*)) 1/(1 ~ a} <err(V)-err(w*). (13) 
As in [6], we will use a different algorithm in the inseparable case (Algorithm 2). 
Algorithm 2 Margin-based Active Learning (non-separable case) 

Input: a sampling oracle for D, and a labeling oracle a sequence of sample sizes mt- > 0, k E Z + ; a sequence of 
cut-off values bk > 0, k E Z + a sequence of hypothesis space radii ru > 0, k E Z + ; a sequence of precision values 
e k > 0, k E Z+ 
Output: weight vector w s . 

• Pick random wo: H^olh = 1. 

• Draw mi examples from Dx, label them and put into W. 

- iterate k = 1, . . . , s 

* find w k £ B(w k -i,r k ) (\\w k \\ 2 = 1) to approximately minimize training error: J2( x , y ) e w ' x v) - 

^hueB(^ urk ) Y,( x ,v)ew I ( w ■ X V) + m k £ k- 

* clear the working set W 

* until TOfe+x additional data points are labeled, draw sample x from Dx 

■ if | ifi* • a; | > bk, reject x 
• otherwise, ask for label of x, and put into W 
end iterate 



By generalizing Theorem|4]so that it provides a stronger bound for larger margins, and combining the result with the 
other lemmas of this paper and techniques from H, we get the following. See Appendix[C]for a proof. 

Theorem 22 Assume that the joint distribution Dxy satisfies the Tsybakov noise condition for constants a E [0, 1) 
and a > 0, and that the marginal D on R d is isotropic log-concave. 

There exist constants C\, C<i and C3 s.t. for d > 4, and for any e, 6 > 0, e < 1/4 the following hold. 

If a = 0, then Algorithm 2 with bk — §t, r k = 2~^ k ~ 2 \ nik = C2 (d + In 1+ ^~ fc ), after s = [log 2 (a/e)] iterations, 
finds a separator with excess error < e with probability 1 — 6. The total number of labeled examples needed is 
0(log(l/e))(d + log(s/<5)), and the total number of examples overall needed is O ^ d + 1 °s( 1 /' 5 ) ^ 

For a E (0,1), using Algorithm 2 with b k = Cl ^-l k) , r k = 2-( fe - 1 )( 1 - Q ), m k = ^ (d + In i± f^). e fc = 
2 ak (i+ak) ' men a ft er s — riog 2 (a/e)] iterations, we find a separator with excess error < e with probability 1 — 5. 
The total number of labeled examples needed is 0((l/e) 2a log 2 (l/e))(d + log(s/<5)). 

The case where a = is more general than the well-known Massart noise condition [40 1. In this case, for active 
learning, Theoreml22limproves over the previously best known results [28 1 by a (disagreement coefficient) dis w * t u(e) 
factor. 

Clearly, Theorem|22]also implies a greedy like passive learning algorithm, where we simply ignore the labeled exam- 
ples that fall within the margin b k of the current separator u>k and otherwise just follow the steps in the active learning 
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algorithm, and in the end output w s . This algorithm uses only O ^ d+l °s( 1 / s ) j labeled examples to output a classifier 
of error at most e with probability at least 1 — 8. This is optimal (up to constant factors) and it improves the previ- 
ously known best bound of Gine and Koltchinskii [21 1 by a log(cap tu , n( e )) factor. It is consistent with recent lower 
bounds that include log(cap„,* £>( e )) B21 because those bounds are for a worst-case domain distribution, subject to a 
constraint on cap^, D {e). 

When a > 0, the previously best result for active learning |28 1 is 

0{{l/ef a &s w * iD {e){d\og{&s w *, D (e)) + log(l/*)). 
Combining this with our new bound on dis u ,» ) £)(e) (Theoreml2"TT> we get a bound of 

0((l/e) 2Q d 3 / 2 log(l/e)(log(d)+loglog(l/e)) + log(l/5)) 

for log-concave distributions. So our Theorem l22l saves roughly a factor of \fd, at the expense of an extra log(l/e) 
factor. 

Note that as opposed to the realizable case, all existing algorithms analyzed under these noise conditions (including 
our algorithms in Theoreml22l) are not known to run in polynomial time. 

The results in this section can also be extended to nearly log-concave distributions by making use of our results in 
SectionloTn 
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A Proof of Theorem ELD 

LetF = A~ 1 f 2 (X-z). ThenF is a log-concave distribution in the isotropic position. Moreover, the density function 
of g is given by g(y) = det(A 1 / 2 )/(A 1 / 2 y + z). Let M = E[XX T }. We have 

A = E[(X - z)(X - z) T ] = E[XX T ] - zz T = M - zz T . 

Also, the fact 1/C < J (u ■ x) 2 f(x)dx < C for every unit vector u is equivalent to 

1/C < u T E[XX T ]u < C 

for every unit vector u. Using v = (1,0), v = (0,1), and v = (1/V2, l/V%) we get that Mi,i € [1/C, C], 
M 2 , 2 € [1/C, C], and M h2 = M 2 ,i £ [1/C — C, C — 1/C]. We also have \\z\\ < £ and det(A 1 / 2 ) = y/det(A). All 
these imply that 

V(l/C-£ 2 ) 2 - (C- 1/C) 2 < dct(A 1 / 2 ) < c. 
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(a) For x = A 1 / 2 !/ Izwe have ||x — z\\ 2 = (x — z)(x — z) T = \\y\\ 2 v T Av, where v = (1/ \\y\\)y is a unit vector, so 
\\y\\ < 11^ — z \\ I yV^" ~~ s • ^ IMI — 1/20 we have ||y|| < 1/9, so by Lemma|2]we have g(y) > c\, so f(y) > c, 
for some universal constants ci, C2, as desired. 

(b) We have /(x) = ^^-^(A^ix-z)). By Lemma|2](b) we have f(x) < exp [-c ^-^(a; - 2 )||] . 
By triangle inequality we further obtain: 



/(*)< 



1 

det (A 1 / 2 ) 



exp 



exp 



For C < 1 + 1/5, we can show that ||A _1 / 2 a;|| > (l/y/2) \\x\\. It is enough to show || A -1 / 2 ^] > (1/2) ||a;|| 2 , or 
that 2 ||u|| > || vi 1 / 2 1 1 , where v — A~ 1 / 2 x (so x = A 1 l 2 v). This is equivalent to 2v T v > v T Av, which is true since 
the matrix 21 — A is positive semi-definite. 



B Proof of Theorem 111 

Define Y = (X-z)/a. WehaveE[F] = 0andE[F 2 ] = 1. The density g of Y is given by g(y) = af(ay + z). Now, 
since g is isotropic and log-concave, we can apply Lemma|2e ) to g. So g(y) < 1 for all y. So, erf (ay + z) < 1 for 
all y, which implies f[x] < 1 ja for all x. The second part follows as in TheoremfTTI 



C Massart and Tsybakov noise 

In this section we analyze label complexity for active learning under the popular Massart and Tsybakov noise condi- 
tions, proving Theoreml22l 



C.l Massart noise (a = 0) 

We start by analyzing Algorithm 2 in the case that a = 0; the resulting assumption is more general than the well-known 
Massart noise condition. 

From the log-concavity assumption, the proof of Theorem[5] with slight modifications, proves that there exists c such 
that for all w we have 

ca9(w,w*) < err(w) — err(w*). (14) 
We prove by induction on k that after k < s iterations, we have 

err(iifc) — err(w*) < ca2 
with probability 1 — | J2 i<k ( 1+s 1 _ i )2 ■ The case k = 1 follows from classic bounds |49l . 

Assume now the claim is true for k— 1 (k > 2). Then at the fc-th iteration, we can let Sx = {x : \wk-i ' x \ < &k-i} an d 
S2 = {x : \wk-\-x\ > bk-i}. By induction hypothesis, we know that with probability at least 1 — | J2i<k-i (i+l-iyi 
Wk-i has excess errors at most ca2~^ k ~ \ implying, using ( fT4] i. that 6(wk-i,w*) < 2~ < - k ~ 1 \ By assumption, 

6(w k - lt w k ) < 2^ k ' 1 \ 

From Theorem|4] recalling that a is a constant, we have both: 

P((i6fe_i • x)(w ■ x) < 0,x G S 2 ) < ca2- k /4 
P((t&fe-i • x){w* ■ x) < 0,x G S 2 ) < ca2' k /A. 
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Taking the sum, we obtain: 

P((w • x)(w* ■ x) < 0, x e S 2 ) < ca2- k /2. (15) 

Therefore: 

err(w k ) - err(w*) < (err(<D fc |Si) - err(w* |S*i))P(S* 1 ) 
+ P((w ■ x)(w* ■ x) < 0,x e S 2 ) 
< (en(wk\Si) - err(w*|S , i))c 3 6 fc _i 
+ ca2- k /2. 

By standard Vapnik-Chervonenkis bounds, we can choose C s.t. with m k samples, we obtain 

ext(w k \Si) - err( W *|Si) < ca2- fc /(c 3 6fe-i) 

with probability 1 - (6/2)/(l + s — i) 2 . Therefore err(w fe ) — err(iu*) < ca2~ k with probability 1 — | J2i<k {T+J=Tp ' 
as desired. 

The bound on the total number of examples, labeled and unlabeled, follows the same line of argument as Theorem[6] 
except with the constants of this analysis. 

C.2 Tsybakov noise (a > 0) 

We now treat the more general Tsybakov noise. 

For this analysis, we need a generalization of Theorem [4] that provides a stronger bound on the probably of large- 
margin errors, using a stronger assumption on the margin. 

Theorem 23 There is a positive constant c such that the following holds. Let u and v be two unit vectors in R d , and 
assume that 9(u, v) = r) < ir/2. Assume that D is isotropic log-concave in R d . Then, for any b > cq, we have 

I I> x~_D[sign(u • x) ^ sign(w ■ a;) and \v ■ x\ > b] < C^i] e^jp(—Ceb / -q) , (16) 

for absolute constants C5 and Cq. 

Proof: Arguing as in the proof of Lemma[3] we may assume without loss of generality that d = 2. 

Next, we claim that each member x of E has | |x| | > b/rj. Assume without loss of generality that v ■ x is positive. (The 
other case is symmetric.) Then u ■ x < 0, so the angle of x with u is obtuse, i.e. 9(x, u) > ir/2. Since 9(u, v) = r), 
this implies that 

0(x,v) > tt/2 — 77. (17) 

But x ■ v > b, and v is unit length, so ||a;|| cos6'(a;, v) > b, which, using ( fTTI ). implies ||x|| cos(7r/2 — i]) > b, which, 
since cos(7r/2 — 77) < r\ for all r\ G [0, 7r/2], in turn implies ||sg| | > b/rj. This implies that, if B(r) is a ball of radius r 
in R 2 , that 

00 

F[E] =^P[Bn (B{{i + l)(6/t/)) - B(i(b/rj)))]. (18) 

i=l 

Let us bound one of the terms in RHS. Choose i > 1. 
Let f(x\, x%) be the density of D. We have 

F[En(B((i + l)(b/ V ))-B(i(b/r,)))} 

= I l E {xi,x 2 )f(x 1 ,x 2 ) dx!dx 2 . 

J(x u x 2 )eB((i+i)(b/ n y)-B(i(b/ v )) 
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Let Ri — B((i + l)(b/rj)) — B(i(b/rj). Applying the density upper bound from Lemma [2] with d = 2, there are 
constants C\ and C 2 such that 

F[E n (B((i + l)(b/r})) - B(i(b/r))))} 

< / lE{xi,X2)C\e-Kp(-(b/r/)C2i)dxidx2 
J (x u x 2 )eRi 

= d exp(-(6/77)C 2i ) • 
/ l E {xi,x 2 ) dxidx 2 . 

J (x 1 ,x 2 )£R i 

If we include B(i(b/rf)) in the integral again, we get 

F[En(B((i + l)(b/ V ))-B(i(b/ V )))} 

< Ci exp(-(6/77)C 2 i) / Ie{xi,X2) dxidx2- 

J(x 1 ,x 2 )eB((i+i)(b/ v )) 

Now, we exploit the fact that the integral above is a rescaling of a probability with respect to the uniform distribution. 
Let C3 be the volume of the unit ball in R 2 . Then, we have 

F[En(B((i + l)(b/ri))-B(i(b/Ti)))] 

< Ci exp(-(6/77)C 2 i)C 3 (i + l) 2 (6/77) 2 77/7r 

= C 4 (b/ V ) 2 r,(i + l) 2 exp(~(b/ v )C 2 t), 

for C4 = CiCs/tt. Returning to ( FOB , we get 

OO 

P[£] = ^C 4 (V'7) 2 ^ + l) 2 exp(-(6/7 ? )^) 

i=l 



= C 4 (V'7)\E( l + 1 ) 2ex P(-( fe /'7)^) 

»=i 

= C^b/rif x — — — -3 x ry. 



V[E] < C 4 (b/r]) 2 x ^ 3 x 7/ 



Now, if 6/77 > 4/C2, we have 

5e 2(6/^)C 2 
( e (&/r7,C 2 / 2 ) 

< C577 x (6/ry) 2 exp(-(6/??)C2)(where C 5 = 40C 4 ) 
= C 5 ?7Xexp(-(6/r;)C 2 + 21n(6/r;)) 

< C577 x exp(-(6/ry)C 2 /2), 

completing the proof. I 
Now we are ready to prove Theorem|22]in the case that a > 0. 

Under the noise condition[l3]and from the log-concavity assumption, we obtain that there exists c such that for all w 
we have: 

ac 1/{1 ~ a) 9( Wl w*) 1/{1 - a) < err(w) - err(w*). 
Let us denote by c = ac 1 /' 1- "' . For all w, we have: 

c8(w,w* ) 1/(1 ~ a) < err(w) - err (to*). (19) 
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We prove by induction on k that after k < s iterations, we have 

err(u)fc) — err(u>*) < c2~ fc 
with probability 1 — | J2i<k h+s-i) 5 • ^ e case ^ — 1 follows from classic bounds. 

Assume now the claim is true for k — 1 (k > 2). Then at the fc-th iteration, we can let Si = {x : \wk-i • x\ < 
bk-i} and S 2 = {x : \w k -\ ■ x\ > b k -\}. By the induction hypothesis, we know that with probability at least 

1 — 5 Yli<k-i (l+s-i) 1 ' Wk-i has excess errors at most c2~( fe -i)(i-") 5 implying 

e(w k -t,w*) < 2-( fc - 1 )( 1 - Q ). 

By assumption, 0{w k -i,w k ) < 2- ( - k - 1 ^ 1 ~ a \ 
Applying Theorem[23] we have both: 

P((w/c-i ■ x){w ■ x) <0,x £ S 2 ) < &2- k /A 
P((u> fc _i • x)(w* -x)<0,xeS 2 )< £2~ k /A 

Taking the sum, we obtain: 

P((w • x)(w* ■ x) < 0, x e S 2 ) < S2- k /2. (20) 

Therefore: 

err(wfe) — err(w*) < (err(w k \Si) — err(u>* |S , i))P(S'i) 
+V((w-x){w* -x) < 0,x e S 2 ) 
< (err(wfc|S'i) - err(w*|5i))6fc 
+c2- k /2. 

By standard bounds, we can choose C\, C 2 and C3 s.t. with m k samples, we obtain err(w k \Si) — err(u;* |Si) < 
£fc < with probability 1 — (5/2)/(l + s — i) 2 . Therefore err(wfe) — err(w*) < 22 _fe with probability 1 — 
I Si<fc (i+l-iyi ■ as desired, completing the proof of Theorem |221 
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